Approximate Policy Iteration using Large-Margin Classifiers
نویسندگان
چکیده
We present an approximate policy iteration algorithm that uses rollouts to estimate the value of each action under a given policy in a subset of states and a classifier to generalize and learn the improved policy over the entire state space. Using a multiclass support vector machine as the classifier, we obtained successful results on the inverted pendulum and the bicycle balancing and riding domains.
منابع مشابه
Approximate modified policy iteration and its application to the game of Tetris
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are exten...
متن کاملar X iv : 0 80 5 . 20 27 v 1 [ cs . L G ] 1 4 M ay 2 00 8 Rollout Sampling Approximate Policy
Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions which focus on policy representation using classifiers and address policy learning as a supervised learning problem. This paper proposes variants of an improved policy iteration scheme which...
متن کاملA Convergent Form of Approximate Policy Iteration
We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with ...
متن کاملAlgorithms and Bounds for Sampling-based Approximate Policy Iteration *
Several approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supervised learning problem, have been proposed recently. Finding good policies with such methods requires not only an appropriate classifier, but also reliable examples for the best actions, covering all of the state space. One major ques...
متن کاملFirst Iteration Policies for Admission Control in Multiaccess Networks
This work explores approximate methods to solve Markov decision processes for large systems through Policy iteration.Two methods, one using an embedded discrete time Markov chain and the other using time scale separation, are defined and compared with the solution obtained using traditional Policy iteration. First step solutions are found and compared for a radio resource management problem wit...
متن کامل